21 research outputs found
Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning
In this work, we investigated the teacher-student training paradigm to train
a fully learnable multi-channel acoustic model for far-field automatic speech
recognition (ASR). Using a large offline teacher model trained on beamformed
audio, we trained a simpler multi-channel student acoustic model used in the
speech recognition system. For the student, both multi-channel feature
extraction layers and the higher classification layers were jointly trained
using the logits from the teacher model. In our experiments, compared to a
baseline model trained on about 600 hours of transcribed data, a relative
word-error rate (WER) reduction of about 27.3% was achieved when using an
additional 1800 hours of untranscribed data. We also investigated the benefit
of pre-training the multi-channel front end to output the beamformed log-mel
filter bank energies (LFBE) using L2 loss. We find that pre-training improves
the word error rate by 10.7% when compared to a multi-channel model directly
initialized with a beamformer and mel-filter bank coefficients for the front
end. Finally, combining pre-training and teacher-student training produces a
WER reduction of 31% compared to our baseline.Comment: To appear in ICASSP 202
Capacity and Coding for 2D Channels
Consider a piece of information printed on paper and scanned in the form of an
image. The printer, scanner, and the paper naturally form a communication channel,
where the printer is equivalent to the sender, scanner is equivalent to the receiver,
and the paper is the medium of communication. The channel created in this way is
quite complicated and it maps 2D input patterns to 2D output patterns. Inter-symbol
interference is introduced in the channel as a result of printing and scanning. During
printing, ink from the neighboring pixels can spread out. The scanning process can
introduce interference in the data obtained because of the finite size of each pixel and
the fact that the scanner doesn't have infinite resolution. Other degradations in the
process can be modeled as noise in the system. The scanner may also introduce some
spherical aberration due to the lensing effect. Finally, when the image is scanned,
it might not be aligned exactly below the scanner, which may lead to rotation and
translation of the image.
In this work, we present a coding scheme for the channel, and possible solutions for a
few of the distortions stated above. Our solution consists of the structure, encoding
and decoding scheme for the code, a scheme to undo the rotational distortion, and
an equalization method.
The motivation behind this is the question: What is the information capacity of paper. The purpose is to find out how much data can be printed out and retrieved
successfully. Of course, this question has potential practical impact on the design of
2D bar codes, which is why encodability is a desired feature. There are also a number
of other useful applications however.
We could successfully decode 41.435 kB of data printed on a paper of size 6.7 X 6.7
inches using a Xerox Phasor 550 printer and a Canon CanoScan LiDE200 scanner. As
described in the last chapter, the capacity of the paper using this channel is clearly
greater than 0.9230 kB per square inch. The main contribution of the thesis lies in
constructing the entire system and testing its performance. Since the focus is on
encodable and practically implementable schemes, the proposed encoding method is
compared with another well known and easily encodable code, namely the repeat
accumulate code
Cross-utterance ASR Rescoring with Graph-based Label Propagation
We propose a novel approach for ASR N-best hypothesis rescoring with
graph-based label propagation by leveraging cross-utterance acoustic
similarity. In contrast to conventional neural language model (LM) based ASR
rescoring/reranking models, our approach focuses on acoustic information and
conducts the rescoring collaboratively among utterances, instead of
individually. Experiments on the VCTK dataset demonstrate that our approach
consistently improves ASR performance, as well as fairness across speaker
groups with different accents. Our approach provides a low-cost solution for
mitigating the majoritarian bias of ASR systems, without the need to train new
domain- or accent-specific models.Comment: To appear in IEEE ICASSP 202
Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition)
In 2008 we published the first set of guidelines for standardizing research in autophagy. Since then, research on this topic has continued to accelerate, and many new scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Accordingly, it is important to update these guidelines for monitoring autophagy in different organisms. Various reviews have described the range of assays that have been used for this purpose. Nevertheless, there continues to be confusion regarding acceptable methods to measure autophagy, especially in multicellular eukaryotes. For example, a key point that needs to be emphasized is that there is a difference between measurements that monitor the numbers or volume of autophagic elements (e.g., autophagosomes or autolysosomes) at any stage of the autophagic process versus those that measure fl ux through the autophagy pathway (i.e., the complete process including the amount and rate of cargo sequestered and degraded). In particular, a block in macroautophagy that results in autophagosome accumulation must be differentiated from stimuli that increase autophagic activity, defi ned as increased autophagy induction coupled with increased delivery to, and degradation within, lysosomes (inmost higher eukaryotes and some protists such as Dictyostelium ) or the vacuole (in plants and fungi). In other words, it is especially important that investigators new to the fi eld understand that the appearance of more autophagosomes does not necessarily equate with more autophagy. In fact, in many cases, autophagosomes accumulate because of a block in trafficking to lysosomes without a concomitant change in autophagosome biogenesis, whereas an increase in autolysosomes may reflect a reduction in degradative activity. It is worth emphasizing here that lysosomal digestion is a stage of autophagy and evaluating its competence is a crucial part of the evaluation of autophagic flux, or complete autophagy. Here, we present a set of guidelines for the selection and interpretation of methods for use by investigators who aim to examine macroautophagy and related processes, as well as for reviewers who need to provide realistic and reasonable critiques of papers that are focused on these processes. These guidelines are not meant to be a formulaic set of rules, because the appropriate assays depend in part on the question being asked and the system being used. In addition, we emphasize that no individual assay is guaranteed to be the most appropriate one in every situation, and we strongly recommend the use of multiple assays to monitor autophagy. Along these lines, because of the potential for pleiotropic effects due to blocking autophagy through genetic manipulation it is imperative to delete or knock down more than one autophagy-related gene. In addition, some individual Atg proteins, or groups of proteins, are involved in other cellular pathways so not all Atg proteins can be used as a specific marker for an autophagic process. In these guidelines, we consider these various methods of assessing autophagy and what information can, or cannot, be obtained from them. Finally, by discussing the merits and limits of particular autophagy assays, we hope to encourage technical innovation in the field
ASR-Aware End-to-end Neural Diarization
We present a Conformer-based end-to-end neural diarization (EEND) model that
uses both acoustic input and features derived from an automatic speech
recognition (ASR) model. Two categories of features are explored: features
derived directly from ASR output (phones, position-in-word and word boundaries)
and features derived from a lexical speaker change detection model, trained by
fine-tuning a pretrained BERT model on the ASR output. Three modifications to
the Conformer-based EEND architecture are proposed to incorporate the features.
First, ASR features are concatenated with acoustic features. Second, we propose
a new attention mechanism called contextualized self-attention that utilizes
ASR features to build robust speaker representations. Finally, multi-task
learning is used to train the model to minimize classification loss for the ASR
features along with diarization loss. Experiments on the two-speaker English
conversations of Switchboard+SRE data sets show that multi-task learning with
position-in-word information is the most effective way of utilizing ASR
features, reducing the diarization error rate (DER) by 20% relative to the
baseline.Comment: To appear in ICASSP 202